open-source model
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models
With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions.
- Research Report > New Finding (0.30)
- Research Report > Experimental Study (0.30)
What's next for Chinese open-source AI
Chinese open models are spreading fast, from Hugging Face to Silicon Valley. In this photo illustration, the DeepSeek apps is seen on a phone in front of a flag of China on January 28, 2025 in Hong Kong, China. The past year has marked a turning point for Chinese AI. Since DeepSeek released its R1 reasoning model in January 2025, Chinese companies have repeatedly delivered AI models that match the performance of leading Western models at a fraction of the cost. Just last week the Chinese firm Moonshot AI released its latest open-weight model, Kimi K2.5, which came close to top proprietary systems such as Anthropic's Claude Opus on some early benchmarks. The difference: K2.5 is roughly one-seventh Opus's price.
- North America > United States > California (0.25)
- Asia > China > Hong Kong (0.25)
- South America > Brazil (0.04)
- (5 more...)
- Information Technology (1.00)
- Banking & Finance (0.95)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
Signal's Founder Built a Chatbot That Can't Spy on You
Signal's Founder Built a Chatbot That Can't Spy on You Welcome back to, TIME's new twice-weekly newsletter about AI. If you're reading this in your browser, why not subscribe to have the next one delivered straight to your inbox? What to Know: Signal's founder is working on encrypted chatbots Moxie Marlinspike, the cryptographic prodigy who wrote the code that underpins Signal and WhatsApp, has a new project--and it could be one of the most important things happening in AI right now. The tool, named Confer, is an end-to-end encrypted AI assistant. It uses smart math to ensure that even though the compute-intensive process of running the AI still happens on a server in the cloud, the only person who can access the unscrambled details of that computation is you, the user.
- North America > United States (0.05)
- Europe > France (0.05)
- Africa (0.05)
- Law (0.71)
- Information Technology > Services (0.50)
What's next for AI in 2026
Our AI writers make their big bets for the coming year--here are five hot trends to watch. In an industry in constant flux, sticking your neck out to predict what's coming next may seem reckless. But for the last few years we've done just that--and we're doing it again. How did we do last time? Here are our big bets for the next 12 months. The last year shaped up as a big one for Chinese open-source models.
- North America > United States > California (0.15)
- Asia > China (0.06)
- North America > United States > Massachusetts (0.04)
- (2 more...)
- Law (1.00)
- Information Technology (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an overly optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions deteriorates performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from scientific papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope that CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress.
Five AI Developments That Changed Everything This Year
President Donald Trump speaks in the Roosevelt Room flanked by Masayoshi Son, Larry Ellison, and Sam Altman at the White House on January 21, 2025. President Donald Trump speaks in the Roosevelt Room flanked by Masayoshi Son, Larry Ellison, and Sam Altman at the White House on January 21, 2025. In case you missed it, 2025 was a big year for AI. It became an economic force, propping up the stock market, and a geopolitical pawn, redrawing the frontlines of Great Power competition. It had both global and deeply personal effects, changing the ways that we think, write, and relate.
- Asia > China (0.43)
- Europe > France (0.05)
- North America > United States > Missouri (0.05)
- (2 more...)
Large Language Model-Based Generation of Discharge Summaries
Rodrigues, Tiago, Lopes, Carla Teixeira
Discharge Summaries are documents written by medical professionals that detail a patient's visit to a care facility. They contain a wealth of information crucial for patient care, and automating their generation could significantly reduce the effort required from healthcare professionals, minimize errors, and ensure that critical patient information is easily accessible and actionable. In this work, we explore the use of five Large Language Models on this task, from open-source models (Mistral, Llama 2) to proprietary systems (GPT-3, GPT-4, Gemini 1.5 Pro), leveraging MIMIC-III summaries and notes. We evaluate them using exact-match, soft-overlap, and reference-free metrics. Our results show that proprietary models, particularly Gemini with one-shot prompting, outperformed others, producing summaries with the highest similarity to the gold-standard ones. Open-source models, while promising, especially Mistral after fine-tuning, lagged in performance, often struggling with hallucinations and repeated information. Human evaluation by a clinical expert confirmed the practical utility of the summaries generated by proprietary models. Despite the challenges, such as hallucinations and missing information, the findings suggest that LLMs, especially proprietary models, are promising candidates for automatic discharge summary generation as long as data privacy is ensured.
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
- (6 more...)
LLM4SFC: Sequential Function Chart Generation via Large Language Models
Glick, Ofek, Tchuiev, Vladimir, Ghoummaid, Marah, Moshkovitz, Michal, Di-Castro, Dotan
While Large Language Models (LLMs) are increasingly used for synthesizing textual PLC programming languages like Structured Text (ST) code, other IEC 61131-3 standard graphical languages like Sequential Function Charts (SFCs) remain underexplored. Generating SFCs is challenging due to graphical nature and ST actions embedded within, which are not directly compatible with standard generation techniques, often leading to non-executable code that is incompatible with industrial tool-chains In this work, we introduce LLM4SFC, the first framework to receive natural-language descriptions of industrial workflows and provide executable SFCs. LLM4SFC is based on three components: (i) A reduced structured representation that captures essential topology and in-line ST and reduced textual verbosity; (ii) Fine-tuning and few-shot retrieval-augmented generation (RAG) for alignment with SFC programming conventions; and (iii) A structured generation approach that prunes illegal tokens in real-time to ensure compliance with the textual format of SFCs. We evaluate LLM4SFC on a dataset of real-world SFCs from automated manufacturing projects, using both open-source and proprietary LLMs. The results show that LLM4SFC reliably generates syntactically valid SFC programs effectively bridging graphical and textual PLC languages, achieving a generation generation success of 75% - 94%, paving the way for automated industrial programming.
- Asia > Middle East > Israel > Haifa District > Haifa (0.05)
- North America > United States (0.04)
- Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)
- Workflow (1.00)
- Research Report > New Finding (0.88)